Supplementary Materials

Reproducible walk-through of codes and data for Buphamalai et.al.

true , true , true , true
2021-04-01

Multiscale network complementarity

Disease modules, the hallmark concept for disease gene localization in network medicine, have been extensively studied on the PPI network. In this study, we expanded this concept across scales by incorporating additional databases that were selected to represent the multiscale biological relationships from genotypes to phenotypes. These multiscale network construction from databases of various formats including the bipartite mapping (e.g. the gene-pathway association), the ontology-based semantic similarity measurement (e.g. the Gene Ontology annotation), and correlation-based relationship extracted from quantitative studies such as GTEx expression data. These networks are complementary on different levels.

Show code
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(pacman)
p_load(patchwork, igraph, tidyverse, cowplot, rmarkdown)

# compute different network properties
if(!file.exists("../cache/network_complementarity_topological.RDS")){
  # load required functions
source("../source/network_properties_analysis.R")

} else{
  print("Load precomputed data")
  g_prop_df <- readRDS("../cache/network_complementarity_topological.RDS")
}
[1] "Load precomputed data"
Show code
network_details <- read_tsv("../data/network_details.tsv", col_types = 'ccccc')

Network details

45 Network layers from six major databases were constructed as detailed below:

Show code
paged_table(network_details %>% select(!type))

Topological complementarity

In addition to the scale comprehensiveness, these networks are also topologically complementary. A number of key network properties including node and link coverage, modularity, assortativity, and social bias, have been compared and shown below.

Social bias: many networks were constructed based on curation from literatures. The social bias of a network is assessed by the Spearman’s correlation coefficient between the network degree of a gene and the number of publications mentioning the gene. The number of publications was queried using the INDRA python module (http://www.indra.bio, accessed on 12 April 2019)

Show code
g_prop_df %>% 
  mutate(type = ifelse(grepl("coex", network), "co-expression",network)) %>%
  group_by(type, property) %>%
  summarise(value = mean(value)) %>%
  paged_table()

Note that, for the co-expression layer, the values showed on the table above are averaged from all 38 tissue-specific networks.

The plot below summarises the table properties.

Show code
# create a list of  plots to patch together
plot = list()
for(prop in unique(g_prop_df$property)){
  plot[[prop]] = g_prop_df  %>% 
    arrange(group) %>% filter(property == prop) %>%
    ggplot( aes(x=group, y=as.numeric(value))) + 
    geom_segment( aes(x=group, xend=group, y=0, yend=value), color="grey80", size=1.5) +
    geom_violin(fill="#F8B100", alpha = 0.4, color = NA) +
    geom_point( aes(color=alphaval), size=4, alpha=0.6) +
    theme_light() + 
    coord_flip() +
    theme(
      panel.grid.major.y = element_blank(),
      panel.border = element_blank(),
      axis.ticks.y = element_blank(),
     # axis.text.y = element_blank(),
    ) +
    guides(color = F)+
    xlab("") +
    scale_color_manual(values = c("#F8B100",NA)) +
    ylab(prop)
  
  # scale y log for some properties (n edges)
  if(prop %in% c("Number of edges")){
    plot[[prop]] = plot[[prop]] + scale_y_log10()
  }
  
  # for the first plot, allows axis label
  if(!prop %in% c("Number of nodes")){
    plot[[prop]] = plot[[prop]] +  theme(axis.text.y = element_blank())
  }
  
}

plot_combine = plot$`Number of nodes` + plot$`Edge density` + plot$`Global clustering` + plot$Assortativity + plot$`Social bias` + plot_layout(nrow = 1)

# uncomment to save the plot as pdf
#ggsave("../Figs/network_properties_characterisation.pdf", plot_combine, height = 2.5, width = 9)

suppressWarnings(print(plot_combine))

The Social bias of the networks

The network similarity

We quantified the similarity of a given pair of networks \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\) using the edge overlap index: \[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\] We used a dissimilarity measure defined as \(d_{AB} = 1 - S_{AB}\) to construct a 2D map \(\mathbf{X} \subset \mathbb{R}^{2}\) that preserves network dissimilarities by employing Kruskal’s non-metric multidimensional scaling (R package MASS) 75. Finally, we compared the measured similarity of each network pair to random expectation: For each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair which we used as random reference distribution to assess the measured overlap similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant.

The MDS plot derived from Jaccard and Overlap Similarity is as follows:

Show code
pacman::p_load(ggrepel, MASS)

# load the precomputed data
if(!file.exists("../cache/network_jaccard_overlap_similarity_df.RDS")){
  source("../source/compute_jaccard_similarity.R")
} else{
  print("load pre-computed network similarity data")
  network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS")
}
[1] "load pre-computed network similarity data"
Show code
# turn df to weight symmatrix matrix through graph
g_overlap <- graph_from_data_frame(network_sim_df[,c(1,2,4)] %>% rename(., weight = overlapindex), directed = F)

sim_overlap <- get.adjacency(g_overlap, attr = "weight")

diag(sim_overlap) = 1

#change similarity to distance
dist_overlap = 1 - sim_overlap 

############
# MDS plot normal 

#### MDS plot for Kruskal
mds<- isoMDS(as.matrix(dist_overlap), k = 2)
initial  value 46.790384 
iter   5 value 32.191184
iter  10 value 23.308862
iter  15 value 21.770579
final  value 21.704111 
converged
Show code
# a data frame of MDS values
mds_df = data.frame(x = mds$points[,1], y = mds$points[,2], network = rownames(mds$points))

# add network metadata and node size
mds_df = mds_df %>%
  left_join(., g_prop_df %>% dplyr::filter(property=="Number of nodes") %>% select(network, value)) %>%
  left_join(., network_details) %>%
  dplyr::filter(!is.na(main_type)) %>%
  mutate(label = ifelse(!grepl("coex", network), subtype, ""),
        # collabel = ifelse(!is.na(type), type, subtype)
         )

# plot the scatters of all networks
p <- mds_df %>% 
  ggplot() + 
  geom_point(aes(x, y, col = main_type, size = value), alpha = 0.5) + 
  geom_text_repel(aes(x, y, label = label)) + 
  theme_cowplot() +theme(
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    axis.ticks.y=element_blank(),
    axis.text.y=element_blank()) + 
  xlab("MDS1") + ylab("MDS2") +
  scale_color_manual(values = c("#F8B100", "#005564"))+
  guides(col = F, size  = F)


p
Show code
#ggsave("../Figs/scatter_Network_complementarity_MDS_Overlap.pdf",plot = p, width = 4, height = 4)

Summary statistics of the network similarity

Overlap index:

Show code
network_sim_df %>% pull(overlapindex) %>% summary
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0001835 0.0121798 0.0331371 0.0419274 0.0543372 0.4899776 

Median overlap indices between co-expression networks are 0.0433192 and non co-expression networks are 0.0183954.

Show code
## Jaccard heatmap

## remove coex_core from the map
p1 = network_sim_df[,1:3] %>% dplyr::filter(!grepl("core", V1), !grepl("core", V2)) %>% 
  # heatmap plot
  ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity among all networks") + theme_minimal()+ theme(axis.text.x = element_text(angle = 90))

p1

We use core transcriptional modules to represent all of the co-expression network. The heatmap below shows the Jaccard and Overlap similarity.

Show code
## remove coex_core from the map
considered_networks = c("coex_core", "reactome_copathway", "ppi", "MP", "HP", "GOMF", "GOBP")

labels = c("co-expression", "co-pathway", "PPI", "MP", "HP", "GOMF", "GOBP ")

p1 = network_sim_df[,1:3] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>% 
  # rescale factor
  mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
         V2 = factor(V2, levels = considered_networks, labels = labels)) %>%
  # heatmap plot
  ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p1

Show code
overlap_df <- network_sim_df[,c(1,2,4)] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>% 
  # rescale factor
  mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
         V2 = factor(V2, levels = considered_networks, labels = labels))

  # heatmap plot
p2 = ggplot(overlap_df) + geom_tile(aes(x = V1, y = V2, fill = overlapindex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("overlap similarity ") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust=1))

p2

Show code
sprintf("average overlap index for all networks are %f", mean(overlap_df$overlapindex))
[1] "average overlap index for all networks are 0.060363"

Network similairty randomisation

For a given pair of network \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\), we computed edge similarity through overlap index:

\[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\]

Foe each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair, wehere we obtained the reference distribution for their similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant (../source/network_overlap_randomisation.R)

Show code
pacman::p_load(tidyverse, patchwork, ggrepel, cowplot)

# Perform the randomisation, or load from cahe (recommend)
source("../source/network_overlap_randomisation.R")
[1] "load the precomputed value from cache"
Show code
# network similarity values
network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS") %>% 
  dplyr::filter(!grepl("core", V1), !grepl("core" , V2))

# edge counts
ecounts <- readRDS("../cache/network_complementarity_topological.RDS")%>% 
  dplyr::filter(property=="Number of edges") %>% 
  pull(value, name = network) 

# compute minimum edge size for each pair
network_sim_df$min_ecount <- apply(network_sim_df, 1, function(x) min(ecounts[x[1]], ecounts[x[2]]))
Show code
# process jaccard and overlap index
jaccard_index <- lapply(randomisation_overlap_result, function(x) x$intersect/x$union)
overlap_index <- lapply(1:length(randomisation_overlap_result), function(x) 
  randomisation_overlap_result[[x]][['intersect']]/network_sim_df$min_ecount[x])

# compute mean and sd, zscore and pvalue
network_sim_df <- network_sim_df %>% 
  mutate(jaccard.mean = sapply(jaccard_index, mean),
         jaccard.sd = sapply(jaccard_index, sd),
   
         overlap.mean = sapply(overlap_index, mean),
         overlap.sd = sapply(overlap_index, sd),
         
         jaccard.zscore = (jaccardIndex-jaccard.mean)/jaccard.sd,
          jaccard.pval = pnorm(jaccard.zscore, lower.tail = F),
               
          overlap.zscore = (overlapindex-overlap.mean)/overlap.sd,
          overlap.pval = pnorm(overlap.zscore, lower.tail = F)
  )


# label p values into classes
network_sim_df <- network_sim_df %>% 
  # make p values in groups
  mutate(overlap.pval_level = cut(overlap.pval, 
                                  breaks = rev(c(1,5e-2, 1e-3, 1e-4, 1e-5, 0)), 
                                  labels = rev(c("ns","*","**","***","****")), 
                                  include.lowest = T, ordered_result = T),
         # compute whether the pair are from both co-expression, or non co-expressions
         type1 = grepl("coex", V1),
         type2 = grepl("coex", V2),
         pair_name = paste(str_remove(V1,"coex_|reactome_"), 
                           str_remove(V2,"coex_|reactome_"), sep = " - "),
         # label only if overlap score higher than 0.2
         pair_label = ifelse(overlapindex > 0.2, pair_name, ""),
         type_pair = factor(type1+type2, levels = 0:2, labels = c("non.coex - non.coex",
                                                                  "coex - non.coex",
                                                                  "coex - coex"))) %>%
  mutate(overlap.pval_level = factor(overlap.pval_level, levels = rev(levels(overlap.pval_level)))) 


network_sim_df %>% count(overlap.pval_level) %>% paged_table()

Despite their wide range of similarity scores, we found that 955 out of 990 network pairs (96.5%) are significantly more similar than random expectation.

Show code
# stable plot results
p_scatter = ggplot(network_sim_df, aes(x = overlapindex, 
                                       y = log2(overlapindex/overlap.mean))) +
  geom_point(aes(col = overlap.pval_level)) + 
  scale_colour_viridis_d(direction = -1) + 
  theme_minimal() +
  xlab(expression(Similarity~(S[AB]))) + labs(title = "Network pair similarity", col = "significance") + 
  ylab(expression(log[2](S[AB]/mu[S[AB]]))) 

# plot by type 
p_scatter_by_type <- p_scatter + facet_grid(. ~ type_pair) + geom_text_repel(aes(label = pair_label))

p_scatter

Show code
pval_lv_count_df <- network_sim_df %>%  count(overlap.pval_level, name = "count")
pval_lv_count_by_type_df  <- network_sim_df %>%  
  count(overlap.pval_level, type_pair, name = "count") %>%
  group_by(type_pair) %>%
  mutate(prop = count/sum(count))

p_count_by_type = ggplot(pval_lv_count_by_type_df, aes(x = overlap.pval_level, y = prop)) + 
  geom_col(aes(fill = overlap.pval_level)) + 
  scale_fill_viridis_d(direction = -1) + 
  xlab("Significance") + ylab("proportion") + guides(fill = F) +
  theme_minimal() + facet_grid(. ~ type_pair) + labs(title = "Network similarity significance level")


p_scatter_by_type/p_count_by_type

Interestingly, we also observed that networks on different scales (i.e. among non co-expression layers) are all significantly similar, showing that there are key edges being maintained across genotype to phenotype.

Co-expression network characterisation

Core- and tissue-specific co-expression

We characterised our tissue-specific co-expression networks based on GTEx. Our hypothesis is that genes that are highly co-expression across all tissues are likely required for cellular developments and survival, and should show a strong correlation with of essentiality. In this analysis, we downloaded the list of human essential genes from the OGEE database (v2), also included in /data/OGEE_esential_genes_20190416.txt.

Show code
## coexpresssion - share edges
## goal: to observe whether the shared edges among co-expression networks are essential

# 0 - load required data
## load coexel sum
library(pacman)
p_load(tidyverse, cowplot, knitr)

coex_el_sum_grouped = readRDS("../cache/coexpression_edge_counts_by_group.RDS")

# create a vector of total probability for each class
coex_el_sum_score  = coex_el_sum_grouped %>% 
  ungroup() %>% 
  group_by(essential_edge_score) %>% 
  summarise(count = sum(n)) %>% pull(count, name = essential_edge_score)

coex_el_sum_grouped %>% 
  dplyr::rename(n_tissues = n_binned_relabel) %>%
  group_by(n_tissues) %>%
  summarise(n_edges = sum(n), percent =  sum(n)*100/sum(coex_el_sum_score)) %>%
  kable
n_tissues n_edges percent
0-5 12056143 91.8978690
5-10 747278 5.6961215
10-15 209979 1.6005635
15-20 70745 0.5392533
20-25 23372 0.1781529
25-30 7347 0.0560025
30-38 4203 0.0320373

The plot for higher essentiality aas number of genes increased is shown below.

Show code
bar_essential = ggplot(coex_el_sum_grouped) + 
  geom_bar(aes(x = n_binned_relabel, y = n, fill = score), stat = "identity",  position="fill") + 
  xlab("Number of tissues") + ylab("Edge proportion") + theme_cowplot() + 
  theme(legend.position = "bottom", axis.text.x = element_text(size = 10)) + 
  guides(fill = guide_legend(title = "Essential gene in edge")) + scale_fill_manual(values = c('grey60','#9ecae1','#3182bd'))

bar_essential
Show code
#ggsave("./Figs/essentiality_co-expression.pdf", bar_essential, width = 4, height = 4)
Show code
coex_el_sum = readRDS("../cache/coexpression_raw_edge_counts.RDS") %>% ungroup
coex_el_sum_by_tissue <- coex_el_sum %>% 
  count(n, essential_edge_score, name = "count")

OrphaNet Rare Disease association and analysis

Cross-scale network signatures of rare diseases

The structure of Orphanet Rare Disease Ontology was queried and processed using R interface of the Ontology Lookup Service (https://lgatto.github.io/rols/index.html). A number of calculation per-computed for further analyses on this section was performed in source/Orphanet_annotate_genes_to_ancestors.R.

Show code
# load direct gene association
orpha_gene_onset_df <- readRDS("../cache/orpha_gene_onset_df.RDS")

# disease gene association with roots
orphanet_gene_association <- read_tsv("../data/orphaNet_disease_gene_association_with_roots.tsv")

# disease gene association at group level
source("../functions/readdata_functions.R")
gene_disease_orpha = process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", 1, 2000)
[1] "read 28 diseases, of total 3593 associated genes."
Show code
#source("../source/read_orphanet_gene_association_data.R")
gene_disease_orpha = gene_disease_orpha$disgene_df

# Modify and merge data
orpha_gene_display_df <- orphanet_gene_association %>%
  dplyr::filter(n_genes > 0) %>%
  mutate(ID = as.double(str_remove(orphaID, "Orphanet:"))) 

DT::datatable(orpha_gene_display_df[,c("ID", "label", "n_genes", "genes")] ,
                extensions = 'Buttons',
                options = list(dom = 'Blfrtip',
                               buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                               lengthMenu = list(c(10,25,50,-1),
                                                 c(10,25,50,"All"))))

Rare diseases are scarcely annotated, and most disease terms (2686 out of 3771) are only associated with one gene. Network-based measurements for individual diseases are unfeasible and grouping of the terms for higher level association are necessary.

Show code
# from orphanet_mapping_top_branch
gene_per_disease = orpha_gene_onset_df %>% 
  dplyr::filter(!is.na(gene)) %>% 
  count(orphaID)

gene_per_disease_count = gene_per_disease %>% 
  mutate(group = cut(n, breaks = c(0:10, 100), labels = c(1:10, "> 10"))) %>% 
  count(group)

p = ggplot(gene_per_disease_count, aes(x = group, y =n)) + geom_col() +
  ggtitle("Most orphanet diseases are immediately associated with one gene") +
  theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()

plotly::ggplotly(p)
Show code
pacman::p_load(cowplot)
#gene_per_disease_group = gene_disease_orpha %>% 
 # mutate(group = cut(n_genes, breaks = c(seq(0,100,20), seq(200,1000, 200), Inf))) %>% count(group)

#ggplot(gene_per_disease_group, aes(x = group, y =n)) + geom_col() +
#  ggtitle("Most orphanet diseases are immediately associated with one gene") +
#  theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()

# number shift
gene_per_disease_group = gene_disease_orpha %>% 
  mutate(disease = "Grouped", n = N) %>% 
  select(disease, n)

gene_per_disease = gene_per_disease %>% 
  mutate(disease = "Individual")

gene_per_disease_both = gene_per_disease %>% 
  select(disease, n) %>%
  bind_rows(., gene_per_disease_group) %>% 
  dplyr::filter(n>0) %>%
  # relevel factor
  mutate(disease = factor(disease, levels = c("Individual","Grouped")))

# add break values for plotting on log scale on x axis
breakvals = c(0:9, seq(10,90,10), seq(100,1000,100), 2000)

gene_per_disease_both_count = gene_per_disease_both %>% 
  mutate(group = cut(n, breaks = breakvals, include.lowest = T, labels = breakvals[-1])) %>%
  group_by(disease, group) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n/sum(n)) %>% 
  full_join(., tibble(group = as_factor(breakvals[-1])))


gene_per_disease_both_count$disease[is.na(gene_per_disease_both_count$disease)] = "Grouped"
gene_per_disease_both_count$n[is.na(gene_per_disease_both_count$n)] = 0
# plot
#ggplot(gene_per_disease_both_count, aes(y = prop, x = group)) + geom_col() +  theme_minimal() +facet_grid(disease ~ .)

p = ggplot(gene_per_disease_both_count, aes(y = n, x = group)) + 
  geom_col() +  
  facet_grid(disease ~ ., scales = "free") +
  #scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  scale_x_discrete(breaks = c(1,10,100,1000))+
  geom_vline(xintercept = which(levels(gene_per_disease_both_count$group)==20), linetype = "dashed", col = "red") + theme_cowplot() + xlab("Genes per disease term") + ylab("Number of terms")

p
Show code
#ggsave("../Figs/orphanet_individual_vs_grouped_diseases_bar.pdf", p, width = 3*1.5, height = 2*1.5)

Based on the plot above, accumulating gene-disease association for descendant disease terms of ‘Rare genetic disease’ ( “Orphanet:98053”) resulting in physiologically distinct disease groups where the majority (26 out of 28) groups are associated with sufficient amount of genes for module detection (\(n=20\)).

The disease gene association can be found in data/table_disease_gene_assoc_orphanet_genetic.tsv. The summary of disease groups and all associated genes are shown below.

Show code
rmarkdown::paged_table(gene_disease_orpha %>% select(name, N) %>% arrange(-N)) 
Omitting disease groups with fewer than 20 associated genes, the number of terms (out of 3771) and genes excluded from further analyses include:
Show code
# roots that are removed
removed_roots <- gene_disease_orpha %>% filter(N<20) %>% pull(name)

# table for removed terms
removed_disease_terms <- orphanet_gene_association %>%
  mutate(n_roots = str_count(roots, ";")+1,
         removed_roots = grepl(removed_roots[1], roots) + grepl(removed_roots[2], roots),
         remained_roots = n_roots - removed_roots) %>%
  filter(remained_roots == 0) %>%
  select(orphaID, label, genes, roots)

removed_genes <- sapply(removed_disease_terms$genes, function(x) str_split(x, ";")) %>% unlist %>% unique()

knitr::kable(removed_disease_terms)
orphaID label genes roots
Orphanet:199241 Pulmonary capillary hemangiomatosis EIF2AK4 Rare genetic respiratory disease
Orphanet:210122 Congenital alveolar capillary dysplasia FOXF1 Rare genetic respiratory disease
Orphanet:217566 Chronic respiratory distress with surfactant metabolism deficiency SFTPC Rare genetic respiratory disease
Orphanet:217563 Neonatal acute respiratory distress due to SP-B deficiency SFTPB Rare genetic respiratory disease
Orphanet:440402 Interstitial lung disease due to ABCA3 deficiency ABCA3 Rare genetic respiratory disease
Orphanet:440392 Interstitial lung disease due to SP-C deficiency SFTPC Rare genetic respiratory disease
Orphanet:444092 Autoimmune interstitial lung disease-arthritis syndrome COPA Rare genetic respiratory disease
Orphanet:31837 Pulmonary venoocclusive disease BMPR2;EIF2AK4 Rare genetic respiratory disease
Orphanet:100051 Hereditary angioedema type 2 SERPING1 Serpinopathy
Orphanet:100050 Hereditary angioedema type 1 SERPING1 Serpinopathy

There are 10 disease terms whose associated genes are not associated with any other disease groups, and hence these 8 genes are omitted.

Tree map for ontology representation of rare diseases

Show code
# download the association
library(RColorBrewer)

orphanet_gene_association_unique_root <- orphanet_gene_association %>%
  separate_rows(., roots, sep = ";", convert = T) %>% dplyr::filter(!is.na(roots)) %>% 
  mutate(roots = as.factor(roots))

## take a function to allow modifying alpha values
# Add an alpha value to a colour
add.alpha <- function(col=NULL, alpha=1){
  if(missing(col))
    stop("Please provide a vector of colours.")
  apply(sapply(col, col2rgb)/255, 2, 
                     function(x) 
                       rgb(x[1], x[2], x[3], alpha=alpha))  
}

# define colours for all disease groups
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nrow(gene_disease_orpha))

# add id 1-28
gene_disease_orpha$id = 1:nrow(gene_disease_orpha)

# add colour with corresponding alpha values to each disease group
gene_disease_orpha_mod <- gene_disease_orpha %>% 
  rowwise() %>%
  mutate(col =mycolors[id])
  #mutate(col = add.alpha(mycolors[id], N/max(gene_disease_orpha$N)))

gene_disease_orpha_mod$root = "Orphanet"

# load voronoi treemap package
if(!"voronoiTreemap" %in% rownames(installed.packages())){
  pacman::p_load_gh("https://github.com/uRosConf/voronoiTreemap")
} else{
  library(voronoiTreemap)
}


gene_disease_orpha_mod$plotlab = str_remove_all( gene_disease_orpha_mod$name, "Rare |genetic | disease| syndrome| disorder")
onto_json <- vt_export_json(vt_input_from_df(gene_disease_orpha_mod, hierachyVar0 = "root", hierachyVar1 = "name", hierachyVar2 = "name", colorVar = "col", weightVar = "N", labelVar = "plotlab"))

vt_d3(onto_json)

Patients level prioritization

In clinical settings, an approximate number of X potential causal variants are obtained after a rare disease patient underwent exome sequencing. With additional stringent pathogenecity evaluation, x% of those variants are filtered out. Yet, there remains 100-x% which regarded as high confidence [cite]. We earlier showed that our informed multiplex propagation is a powerful tool for improving the retrieval rate of disease causal genes. We propose that it can also act as an additional evaluation metric for patients’ variant prioritization. As a test scenario, we obtained variant lists from five rare disease patients with neurological symptoms. The causal variant for each patient has already been validated and we can therefore use them for benchmarking.

Show code
variants <- read_csv("../data/variants_list/patient_01.csv")

# remove NA entries that may have created during file reading
variants <- variants[!is.na(variants$SYMBOL),]

knitr::kable(variants)
SYMBOL CHROM POS score ID Causative
TAS2R43 12 11244734 6.318 ENST00000531678.1:c.95C>G NO
SULT1A1 16 28617494 11.270 ENST00000395609.1:c.658G>A NO
ZNF414 19 8575770 9.448 ENST00000393927.4:c.1064C>T NO
JADE3 X 46918243 16.040 ENST00000218343.4:c.2236T>C NO
CCDC22 X 49105278 14.930 ENST00000376227.3:c.1432A>C NO
HUWE1 X 53561062 32.000 ENST00000342160.3:c.12928G>C YES
FAAH2 X 57515286 17.560 ENST00000374900.4:c.1520T>G NO
ABCB7 X 74291377 15.380 ENST00000253577.3:c.1177A>G NO
GPR119 X 129518716 21.500 ENST00000276218.2:c.706C>G NO

This patient are diagnosed, i.e. has a confirmed causal variant. We wonder whether the causal gene can be picked up by tha algorithm. The patient are presented with symptoms belonging to rare genetic neurological disorder, and we used the 10-fold CV retrieval result for this disease group to rank the genes based on their average visiting probability. We incorporated additional columns to the variant sets: the average visiting probability (from 10-fold CV), the corresponding global rank (from the entire genome) and the local rank (rank of the gene for the paitent).

Show code
disease_group <- "Rare genetic neurological disorder"

# result from using significant networks
rank_df_all_folds <- readRDS("../cache/LCC_CV_ranking_rare_genetic_diseases_no_LCC_recompute.RDS")

combn_visiting_prob <- list()
for(i in names(rank_df_all_folds)){
  combn_visiting_prob[[i]] <- rank_df_all_folds[[i]] %>%
    bind_rows() %>%
    group_by(GeneName, trueset) %>%
    summarise(avg_prob = mean(avg), .groups =  "drop") %>%
    ungroup() %>%
    arrange(-avg_prob) %>%
    mutate(rank = 1:n())
}


rank_df_disease <- combn_visiting_prob[[disease_group]]

variants_annotated <- left_join(variants, rank_df_disease, by = c("SYMBOL"="GeneName")) %>%
  select(-trueset) %>%
  rename(GlobalRank = rank) %>%
  arrange(GlobalRank) %>%
  mutate(LocalRank = 1:n())

knitr::kable(variants_annotated)
SYMBOL CHROM POS score ID Causative avg_prob GlobalRank LocalRank
HUWE1 X 53561062 32.000 ENST00000342160.3:c.12928G>C YES 5e-07 208 1
CCDC22 X 49105278 14.930 ENST00000376227.3:c.1432A>C NO 2e-07 1919 2
ABCB7 X 74291377 15.380 ENST00000253577.3:c.1177A>G NO 2e-07 2598 3
ZNF414 19 8575770 9.448 ENST00000393927.4:c.1064C>T NO 2e-07 4861 4
JADE3 X 46918243 16.040 ENST00000218343.4:c.2236T>C NO 1e-07 8222 5
FAAH2 X 57515286 17.560 ENST00000374900.4:c.1520T>G NO 0e+00 11710 6
SULT1A1 16 28617494 11.270 ENST00000395609.1:c.658G>A NO 0e+00 12267 7
GPR119 X 129518716 21.500 ENST00000276218.2:c.706C>G NO 0e+00 17956 8
TAS2R43 12 11244734 6.318 ENST00000531678.1:c.95C>G NO 0e+00 20124 9

We observed our algorithm picked a variant on HUWE1, which was confirmed to be the causal variant.